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Abstract 

During  mission  execution  in  military  applications,  the  TRADOC  Pamphlet  525-66  Battle  Command  and  Battle  Space 
Awareness  capabilities  prescribe  expectations  that  networked  teams  will  perform  in  a  reliable  manner  under  changing 
mission  requirements  and  changing  team  and  individual  objectives.  In  this  paper  we  first  present  an  overall  view  for 
dynamical  decision-making  in  teams,  both  cooperative  and  competitive.  Strategies  for  team  decision  problems,  including 
optimal  control,  N-player  games  ( H°°  control,  non-zero  sum)  and  so  on  are  normally  solved  offline  by  solving  associated 
matrix  equations  such  as  the  coupled  Riccati  equations  or  coupled  Hamilton-Jacobi  equations.  However,  using  that 
approach,  players  cannot  change  their  objectives  online  in  real  time  without  calling  for  a  completely  new  offline  solution 
for  the  new  strategies.  Therefore,  in  this  paper  we  give  a  method  for  learning  optimal  team  strategies  online  in  real 
time  as  team  dynamical  play  unfolds.  In  the  linear  quadratic  regulator  case,  for  instance,  the  method  learns  the  coupled 
Riccati  equations  solution  online  without  ever  solving  the  coupled  Riccati  equations.This  allows  for  truly  dynamical  team 
decisions  where  objective  functions  can  change  in  real  time  and  the  system  dynamics  can  be  time-varying. 
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I .  Introduction 

U.S.  Army  Training  and  Doctrine  Command  (TRADOC) 
Pamphlet  525-66  identifies  Force  Operating  Capabilities 
required  for  the  Army  to  fulfill  its  mission  for  a  networked 
Warfighter  concept.  Two  such  capabilities  are  Battle  Com¬ 
mand  and  Battle-Space  Awareness  for  which  there  are 
expectations  that  networked  teams  will  perform  in  a  reliable 
manner  under  changing  mission  requirements  and  changing 
team  and  individual  objectives.  These  capabilities  are  neces¬ 
sary  in  the  asymmetric  battles  waged  against  insurgencies, 
where  enemy  combatants  quickly  adapt  to  Army  strategies 
and  tactics.1  This  need  is  compounded  by  the  fact  that  insur¬ 
gents  have  increasingly  become  more  difficult  to  detect  due 
to  their  knowledge  of  the  local  terrain  and  their  ability  to 
mix  with  civilian  populations.  Over  time,  the  needs  of  sol¬ 
diers  change  in  response  to  new  insurgent  strategies,  quickly 
making  existing  technologies  and  systems  obsolete.  Since 
the  Department  of  Defense  currently  does  not  have  plans  for 
fleet-wide  upgrades  for  robots,2,3  real-time  adaptive  team 
responses  to  insurgent  threats  are  clearly  key  to  mitigate  the 
risk  in  uncertain  and  dynamic  battle-spaces. 


Battlefield  or  disaster  area  teams  may  be  heterogeneous 
networks  consisting  of  interacting  humans,  ground  sensors, 
and  unmanned  airborne  or  ground  vehicles  (UAV,  UGV). 
Such  scenarios  should  provide  real-time  learning  of  opti¬ 
mal  game  strategies  under  changing  mission  requirements 
and  team  objectives.  This  requires  adaptive  algorithms  for 
online  learning  of  optimal  solutions  to  multi-player  games 
that  facilitate  keeping  strategies  updated  as  team  and  player 
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objective  functions  change.  Current  methods  of  solving  for 
optimal  game  strategies  require  offline  solution  of  coupled 
matrix  equations,  which  does  not  allow  for  straightforward 
updating  of  decision  policies  when  objectives  change. 
Required  are  deployable  dynamic  learning  algorithms  for 
keeping  decision  policies  current  to  support  mission  tailor¬ 
ing,  force  responsiveness  and  agility,  ability  to  change 
missions  without  exchanging  forces,  general  adaptability  to 
changing  battlefield  conditions,  and  defense  against  ballis¬ 
tic  missile  attack.4 

This  paper  has  two  parts.  First,  it  presents  an  overall  view 
of  team  behaviors  and  dynamical  decision-making  in  teams, 
both  cooperative  and  competitive.  In  the  second  part,  we 
show  how  to  learn  optimal  game  strategies  online  in  real¬ 
time  by  observing  data  along  the  system  trajectories  as 
players  interact  with  each  other  in  cooperative  or  competitive 
play.  In  the  first  part  we  discuss  cooperation,  collaboration, 
altruistic  vs.  selfish  behavior,  antagonism,  competition, 
incentives,  minimum  risk,  cheating,  and  other  concepts  of 
multi-player  team  play.  These  concepts  and  others  are  rather 
easy  to  define  clearly  in  terms  of  different  objective/payoff 
functions,  and/or  different  optimality  criteria. 

Strategies  for  team  decision  problems,  including  opti¬ 
mal  control,  /V-player  games  (non-  zero  sum,  zero  sum)  and 
so  on  are  normally  solved  offline  by  solving  the  coupled 
Hamilton-Jacobi  (HJ)  equations  for  non-linear  systems  or 
coupled  Riccati  equations  for  linear  systems.  However, 
using  that  approach,  players  cannot  change  their  objectives 
online  in  real  time  without  calling  for  a  completely  new 
offline  solution  for  the  new  strategies.  Therefore,  in  the 
second  part  of  this  paper  methods  are  given  for  solving 
different  team  decision  problems  online  in  real  time  by 
observing  data  along  the  system  trajectories.  This  provides 
a  truly  dynamic  framework  for  team  decision-making, 
since  players  or  teams  can  change  their  objectives  or  opti¬ 
mality  criteria  on  the  fly,  and  the  new  strategies  for  all 
players  appropriate  to  the  new  situation  can  be  re-computed 
in  real  time.  This  approach  also  allows  for  time-varying 
team  dynamics. 

The  interplay  between  protagonists  and  opponents  in 
team  play  is  complex.  Often,  one’s  play  is  improved  if  one 
has  to  face  an  adversary.  Our  definitions  of  team  behaviors 
in  the  first  part  of  the  paper  were  inspired  by  the  summary 
of  Chandler  et  al. 1  The  discussions  on  dynamical  games  are 
based  on  Basar  and  Olsder.5  A  survey  of  reinforcement 
learning  (RL)  techniques  for  solving  multi-player  games  is 
presented  in  an  award  winning  paper  by  Busoniu  et  al.6 

The  approach  given  in  the  second  part  of  the  paper  for 
online  learning  of  optimal  game  solutions  is  based  on  con¬ 
cepts  from  RL.7-8  Every  living  organism  interacts  with  its 
environment  and  uses  those  interactions  to  improve  its  own 
actions  in  order  to  survive  and  increase.  Charles  Darwin 
showed  that  species  modify  their  actions  based  on  interac¬ 
tions  with  the  environment  over  long  time  scales,  leading  to 


natural  selection  and  survival  of  the  fittest.  RL  refers  to  an 
actor  or  agent  that  interacts  with  its  environment  and  modi¬ 
fies  its  actions,  or  control  policies,  based  on  stimuli  received 
in  response  to  its  actions.  RL  implies  goal  directed  behavior 
at  least  insofar  as  the  agent  has  an  understanding  of  reward 
versus  lack  of  reward  or  punishment.  Using  a  form  of  RL 
known  as  policy  iteration ,8  we  develop  an  algorithm  for 
online  learning  of  the  solution  to  the  iV-player  (zero-sum, 
non-zero-sum)  game  problem.  In  this  algorithm,  the  optimal 
value  of  the  game  and  the  Nash  equilibrium  solution  are 
learned  in  real-time  as  the  players  play  together  in  a  dynami¬ 
cal  system  scenario.  The  non-linear  system  case  is  presented. 
In  the  linear  quadratic  regulator  special  case,  the  algorithm 
learns  the  solution  to  coupled  Riccati  equations  online,  with¬ 
out  ever  actually  solving  the  coupled  Riccati  equations. 

Simulation  examples  show  that  the  team  learns  the  cor¬ 
rect  Nash  equilibrium  solution. 

2.  Different  objectives  and  the  behaviors 
of  teams 

The  framework  for  team  behaviors  which  we  present  in  this 
paper  can  be  applied  for  general  non-linear  multi-player 
dynamical  systems,  in  continuous  time  or  discrete  time.  We 
specialize  to  the  linear  time-invariant  (LTI)  continuous¬ 
time  dynamical  systems  simply  for  ease  of  discussion. 
Therefore,  consider  the  continuous-time  LTI  dynamical 
system  given  by 

x  =  Ax  +  Bail  +  B2U2  (1) 

with  state  x(t)  £  R"  and  two  control  inputs  or  players 
ui (?)  £  R"n,  ui(t)  £  Rm.  The  players  may  be  cooperating  or 
competing. 

Define  objective  functions  for  players  1  and  2  respec¬ 
tively  as 

Ji{x{f),ui,u2)  =  \ I  Li(x(t),ui(t),u2(t)) dr 

t 

=  T  /  (,xT Qix  +  ui  R11U1  +  ul R12U2)  dz  (2) 

t 

J2(x(t),ui,u2)  =  \  I  Li(x(t),u\(t),u2(t)) dz 

t 

=  t/  (xTQ2x  +  uiR2iui  +  u2R21u2)dT  (3) 

t 

These  are  infinite  horizon  performances-to-go  starting  at 
time  t  in  state  x(t).  They  capture  information  equivalent  to 
the  payoff  matrices  of  static  games.5 

In  this  paper  one  considers  problems  of  minimizing  the 
objective  functions.  That  is,  the  objectives  are  considered 
as  costs  to  be  made  small  by  proper  selection  of  the  players’ 
strategies. 
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The  integrands  Lfx,  uv  u2)  are  defined  point  wise  at  a 
time  t  in  terms  of  weighting  matrices  Q  and  R  and  are 
known  (loosely)  as  Lagrangians,  or  as  utility  functions. 
They  are  selected  by  the  players  or  a  higher-level  authority 
depending  on  the  performance  requirements  of  the  system. 
Interpreting  the  control  inputs  as  feedback  control  strate¬ 
gies  ux(x),  u2{ x),  also  called  policies,  that  depend  on  the 
current  system  state  x(t),  then  Jt(x(t),  m, (x),  un(x)),  J2  (x(t), 
u2(x))  represent  the  costs  to  players  1  and  2  respec¬ 
tively  of  motion  along  the  system  trajectories  given  the 
current  strategies  starting  at  time  t  in  state  x(t).  A  variety  of 
team  decision  problems  and  team  behaviors  can  be  defined 
through  the  choice  of  the  objective  functions. 

2.  /  Team  coordination 

We  follow  fairly  closely  the  nice  list  of  definitions1  in  discuss¬ 
ing  different  sorts  of  team  behaviors.  Coordination  is  the 
closest  and  most  cohesive  form  of  cooperation  in  teams.  There, 
all  team  members  share  a  common  objective  function.  That  is, 

Jl{x{t),Ul,Ul)  =  J2(x(t),Ui,U2) 

1  r  t  t  ,T'  (4) 

=  I  (x  Qx  +  Mi  R\\U\  +  U2  R\2n2 )  dz 

t 

All  players  have  the  same  optimality  criteria,  namely  to 
minimize  the  objective  function.  This  is  exactly  the  standard 
optimal  control  problem.9  Military  vehicle  formations  and 
convoys  represent  scenarios  where  all  team  members  are 
obligated  to  participate  and  are  bound  to  all  assignments, 
tasks,  or  agreements. 

The  solution  to  this  problem  is  given  by  solving  the 
algebraic  Riccati  equation  (ARE) 

0  =  ATP  +  PA  +  Q  -  PB<RnlBlTP  -  PB^Rx2lB2P  (5) 

and  the  feedback  control  policies  are  given  by 

m,  =  -Kxx  =  -Rtp1B  tTPx,  u,=  -K2x  =  -Rv'b7Px  (6) 

The  Riccati  equation  solution  must  be  performed  offline  a 
priori,  and  it  defines  the  feedback  policies  for  all  time. 
Unfortunately,  this  rules  out  the  possibility  of  truly  dynamic 
team  behavior  since  the  utility  function  weighting  matrices 
cannot  be  changed  on  the  fly  in  real  time.  We  show  how  to 
fix  this  in  Section  4  through  online  learning  of  the  optimal 
strategies. 

2.2  Team  cooperation  and  collaboration 

Cooperation  is  a  looser  form  of  team  cohesiveness  whereby 
each  player  can  have  its  own  objective  function,  as  given 
in  (2)  and  (3),  in  addition  to  team  objective  functions.  If 
each  player  seeks  to  minimize  their  own  cost  function,  the 


solution  to  this  optimal  control  problem  is  given5  in  terms 
of  the  two  coupled  AREs 

0 = a/p,  +  PA  +  2,  +  PARulBtpt  + 

P,B2R22  'R|2P22  'B2'P ,  (7) 

0  =  AtTP2  +  PA  +  Q2  +  P2BA  lR2iRu  Bi'Pi  + 

Pfifin  'B2Pi  (8) 

where  the  closed-loop  system  matrix  is 

Ac=A-BA~B2K2  (9) 

and  the  optimal  feedback  policies  are  given  by 

m,  =  -Kxx  =  —Rn  1B1TP1x, u,=  -Kpc  =  -R2-,  'BdP,x  (10) 

Collaboration  is  a  still  looser  form  of  team  behavior 
whereby  each  player  seeks  to  optimize  their  objective  func¬ 
tion  without  compromising  team  task  completion.  The 
spectrum  between  coordination,  cooperation,  and  collabo¬ 
ration  is  a  continuum  that  depends  on  the  closeness  of  the 
objective  functions  of  individual  players.  For  example,  the 
behavior  of  an  Army  medic  and  a  wounded  Soldier  can  be 
modeled  by  cooperative  solutions,  since  both  have  different 
private  objective  functions,  but  share  a  team  objective  to 
move  to  a  safe  zone.  A  Predator  drone,  on  the  other  hand,  is 
more  likely  to  exhibit  collaborative  behaviors  for  intelli¬ 
gence,  surveillance,  and  reconnaissance  (ISR),  as  long  as  it 
has  enough  fuel  to  fly. 

To  reflect  the  greater  cohesiveness  in  cooperation  than 
in  collaboration.  Chandler  et  al.1  suggest  defining  a  team 
objective  function  J uam(x(t),  uv  u2)  and  then  setting 

7j(x(f),  M,,  u2)  =  (l-wpj^ixit),  uv  u2)  + 

wiJip(x(t),uvu2)  (11) 

J2(x(t),  u2)  =  (1  -w2)Jleam(x(t),  uv  u2 )  + 

w2J2p{x(f),itvu2)  (12) 

where  Jlp,J2p  are  private  objectives  for  each  player  and 
wvw2  are  weightings  selected  to  put  more  or  less  emphasis 
on  team  objectives  as  compared  with  private  objectives. 

2.3  Competition  and  conflict:  zero-sum  games 

The  resources  available  for  survival  or  operation  are  often 
limited.  Different  players  or  different  teams  may  compete 
against  each  other  for  the  same  limited  resources,  such  as 
bandwidth  on  communication  networks  or  natural  resources 
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such  as  land.  The  extreme  case  of  competitive  behavior  is 
when 


- J2(x(t),Ui,Ui  )  =  Ji(x(t),ui,u2) 

=  \j  (x'  Qx  +  uI  R]\U\-uIR\2U7)  dr  (13) 

That  is,  when  one  player  wins,  the  other  loses  by  the  same 
amount.  This  is  the  standard  two-player  zero-sum  game.5,10 
If  both  players  seek  to  minimize  their  respective  costs,  the 
optimal  solution  is  given  by  the  solution  to  the  game  (or 
generalized)  ARE 

0  =  ATP  +  PA  +  Q-  PB^-'B^P  +  PB2RnlB2TP  (14) 
with  the  optimal  feedback  strategies  given  by 

m,  =  -Kt x  =  -A’n  'BfPx,  u-=  K,x  =  Rl2  'Bj'Px  (15) 


The  solution  to  the  two-player  zero-sum  game  also  pro¬ 
vides  the  solution  to  the  bounded  L,-gain  problem,  wherein 
the  control  input  uft)  seeks  to  guarantee  bounds  on  the 
output  in  the  face  of  a  system  disturbance  given  by  m,(?).  In 
this  context  one  selects  =  2  for  a  fixed  scalar  >  0  and 
the  guaranteed  bound  is  given  in  terms  of  L2  function 
norms  as 


llzll<  rlltfill 


(16) 


with  z{t )  the  performance  output.  Under  reachability  and 
observability  conditions  this  solution  exists  and  is  unique 
for  large  enough  >  0. 


2.4  Decomposition  of  objective  functions  into  team 
goals  plus  conflict  of  interest  goals 

The  objective  functions  of  each  player  can  be  written  as  a 
team  average  term  plus  a  conflict  of  interest  term. 

For  the  case  of  two  players  one  has 

Jl  =  \{Jl  +  Jl)+\ (7,-72  )  =  7teara  +  7T  (17) 

7:  =  |  (7,  +  Jl  )  +  |  (72-7l )  =  7team  +  JT  ( 1 8) 

For  three  players 

7,  =  y  (7l  +  7l  +  73  )  +  y  (7l— 72  )  +  y  (7i-73 )  =  7,eam  +  J\°'  (19) 
72  =  I  (7,  +  72  +  73 )  + 1  (72-7, )  + 1  (72-73 )  =  7,cam  +  JT  (20) 
7.,  =  |(7,  +  72  +  73)  +|(73-7,)  +  |(73-72)  =  7team  +  JT  (21) 
For  N  players  one  may  write 

7  =  jrjr  Jj  +  ,  V(7-  7  )  =  7team  +  7,coi,  i=l,  N  (22) 

3=1  3=1 


For  N-player  zero-sum  games,  the  first  term  is  zero,  i.e.  the 
players  have  no  goals  in  common.  The  case  of  zero-sum 
multi-player  games  in  a  competition  mode  is  discussed  by 
Busoniu  et  al.6 

3.  Different  optimality  criteria  and  the 
behaviors  of  teams 

The  behaviors  of  teams  and  individual  players  change 
depending  on  the  selection  of  the  objective  functions.  Given 
the  multi-player  games  just  described,  one  can  have  further 
differing  team  behaviors  depending  on  the  prescribed  opti¬ 
mality  criteria,  and  also  on  the  definition  of  equilibrium 
point. 

3. 1  Nash  Equilibria  and  Myopic  Self-Improvement 

Let  each  player  seek  to  minimize  their  own  objective  func¬ 
tion.  The  Nash  equilibrium  policy 5  (ul ,  ul )  for  a  two-player 
game  is  defined  by  the  conditions 


7i  (x,  Ui,m)<Ji(x,Ui,  ul ) 


Ji ( x , Mi , m2 )  <  J2 (x, ul,u2)  (23) 

This  means  that  if  either  player  changes  their  own  strategy 
while  the  other  does  not,  they  do  worse  in  terms  of  having 
an  increased  cost.  Nash  equilibria  are  stable  in  the  sense  that 
a  single  player  cannot  improve  their  performance  by  unilat¬ 
eral  actions.  Each  player  considers  only  their  own  selfish 
cost.  Under  certain  standard  conditions,5  Nash  equilibria 
exist  and  are  unique. 

In  the  context  of  two-player  zero-sum  games,  the  optimality 
criterion  can  be  expressed  as 

Ji(x(t),u\  ,ui)  =  A-minmax  /  (x'Qx  +  ul  Rw  ii\-u  \  R\iih )  dr  (24) 

Z  Ml  M2  J 


whereby  player  1  seeks  to  minimize  the  objective  function 
while  player  2  seeks  to  maximize  it.  Then  the  Nash  condition 
can  be  written  as 


Ji(x,m,U2)  <  Ji(x,ui  ,ui)  <  7,(x,m, ,m2)  (25) 


3.2  Pareto  equilibria,  agreements,  and  cheating 

A  different  sort  of  optimality  criterion  is  defined  by  the 
Pareto  equilibrium,  which  for  two  players  satisfies  the 
conditions 


if  7i(jc,m,  ,u2 )  <  7,(x,M*,M2),then  J2(x,ul,u2)  <  72(x,m,  ,m2) 


if  J2(x,ui  ,u2)  <  Ji(x,u\  ,u2), 
then  7  (x,  ul , u2)  <  Ji (x, ul , u2) 


(26) 
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This  means  that  if  either  player  adopts  a  strategy  other  than 
the  equilibrium,  either  they  will  incur  increased  costs  or 
other  players  will.  This  is  an  altruistic  sense  of  equilibrium 
wherein  all  players  seek  to  help  team  members  improve 
their  performance.  This  particular  optimality  criterion  aligns 
well  to  the  Army  Core  Values  of  loyalty  and  self-service. 

Pareto  equilibria  are  not  necessarily  unique.  To  obtain  a 
unique  equilibrium,  additional  side  agreements  are  needed 
between  the  players.  Moreover,  Pareto  equilibria  require 
cooperation  between  players  and  an  agreement  that  none 
will  act  so  as  to  harm  another.  Pareto  performance  is  sub¬ 
ject  to  cheating ,  defined  as  a  situation  where  at  least  one 
player  does  not  follow  the  agreed-upon  rules.  By  cheating, 
a  single  player  may  be  able  to  improve  their  performance  at 
the  expense  of  their  team  mates. 

3.3  Antagonistic  behavior 

Another  sort  of  equilibrium  is  defined  by  the  conditions 

Jl{x,Ul  ,Ul)  >  Jl(x,Ui,Ul) 

J\(x,u\  ,ui)  >  Ji(x,ui  ,u2)  (27) 

This  means  that  if  player  1  diverges  from  the  equilibrium 
policy,  then  player  2  will  have  an  improved  payoff  in  terms 
of  decreased  costs,  and  vice  versa.  That  is,  each  player  is 
interested  in  harming  the  other  as  much  as  possible.  This  is 
a  definition  of  antagonistic  behavior.  The  uniqueness  of 
such  equilibria  needs  to  be  established. 

3.4  Leader-follower  games  and  team  incentives 

Consider  the  case  of  two  players  and  define  the  equilibrium 
solution  as 

Ji(x,Ui  ,u2)  <  Ji(x,ui,u2)  for  a  fixed  policy  ufx) 

J\(x,ul,u2)  <  J\(x,ux  ,u2)  (28) 

This  is  a  hierarchical  decision  problem  in  which  player  1 
acts  as  the  leader  and  player  2  as  the  follower.  The  objective 
of  lead  player  1  is  to  determine  an  incentive,  through  selec¬ 
tion  of  their  policy  ufx),  so  that  the  follower  will  always 
play  so  as  to  minimize  the  leader’s  cost  while  seeking  to 
minimize  their  own.  This  is  known  as  a  Stackelberg  game. 

In  the  case  of  three  or  more  players,  one  can  have  several 
definitions  of  equilibrium  point.5  Consider  the  definition  for 
three  players  given  by 

J2(x,Ui  ,ul,ul)  <  J2(x,u,  ,u2,ui)  for  a  fixed  policy  ufx)  (29) 
h(x,ui  ,ui,ul)  <  J2(x,ui,u2,u3)  for  a  fixed  policy  ufx)  (30) 
J\(x,u\  ,ui,m)  <  J 3 (x, U\  ,u2,u2 )  (31) 


In  this  situation,  players  2  and  3  are  followers  who  adopt  a 
Nash  equilibrium  with  regards  to  each  other  in  the  follow¬ 
ers  subgame  for  each  policy  of  the  leader.  The  objective  of 
lead  player  1  is  to  determine  an  incentive  so  that  the  follow¬ 
ers  will  always  act  to  minimize  their  cost  while  seeking  to 
minimize  its  own. 

Stackelberg  strategies  have  been  explored  in  the  con¬ 
text  of  international  terrorism11  and  show  their  obvious 
implications  within  existing  government/military  lead¬ 
ership  hierarchies.  lohn  Keegan,  who  wrote  a  history 
of  men  at  war  in  The  Face  of  Battle,  talks  about  how 
‘the  personal  bond  between  leader  and  follower  lies  at 
the  root  of  all  explanations  of  what  does  and  does  not 
happen  in  battle’.  This  quote  eloquently  describes  how 
the  incentive  for  soldiers  to  follow  orders  from  their 
superiors  can  seriously  affect  both  soldier  morale  and 
wartime  outcomes. 

4.  Online  learning  of  optimal  team 
strategies 

It  has  been  shown6  in  cooperative  games  that  the  agents  use 
the  same  objective  function  and  they  use  greedy  policies  to 
maximize  their  common  return.  Furthermore  a  variety  of 
team  and  individual  player  strategies  can  be  defined  by 
suitable  selection  of  payoff  objective  functions  and  suitable 
definitions  of  optimality.  Normal  approaches  to  solving  for 
optimal  strategies  for  team  decision  problems  involve 
offline  solution,  such  as  the  coupled  Riccati  equations. 
However,  using  that  approach,  players  cannot  change  their 
objectives  online  in  real  time  without  calling  for  a  com¬ 
pletely  new  offline  solution  for  the  new  strategies.  In  this 
section  we  show  how  to  compute  optimal  team  strategies 
online  in  real  time  by  learning  based  on  observed  data 
along  the  system  trajectories.  This  provides  a  truly  dynamic 
framework  for  team  decision  making,  since  players  or 
teams  can  change  their  objectives  or  optimality  criteria  on 
the  fly,  and  the  new  strategies  for  all  players  appropriate  to 
the  new  situation  are  then  re-computed  in  real  time.  This 
online  gaming  approach  also  allows  for  time-varying  team 
dynamics. 

This  learning  approach  is  based  on  RL  techniques.7'8  A 
survey  on  multi-agent  RL  is  presented  by  Busoniu  et  al.6  It 
is  a  general  method  for  solving  optimal  decision  problems 
for  general  non-linear  dynamical  systems,  and  will  be  illus¬ 
trated  for  the  non-linear  two-player  game  solution  (zero-sum 
or  non-zero  sum). 

4. 1  N-player  non-linear  games 

Consider  the  A-player  non-linear  time-invariant  differential 
game  on  an  infinite  time  horizon 

N 

x  =f(x)  +  ^  gj  (x)  Uj 
j= 1 


(32) 
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where  state  x(t)eR",  controls  Uj(t )  G  Rm‘ .  Assume  that/[x) 
is  continuously  differentiable  and/(0)  =  0  so  that  x  =  0  is  an 
equilibrium  point  of  the  system.  The  cost  functionals  asso¬ 
ciated  with  each  player  are2 


N 

Ji  (x  (0),  u\ ,  ui , . . .  un  )  =  /  ( Qi  (x)  +  X]  uj Rij Ui )  dt 
o  i=l 

=  j  n{x(t),Ui,U2,  ...Un)  dt; 


i  e  N 


(33) 


where  function  Q.^x)  >  0  is  generally  non-linear,  and  R..  >  0, 
R.j  >  0  are  symmetric  matrices. 

Given  admissible  feedback  policies/strategies  uff)  = 
p.(x)  the  value  is 


Vi(x(0),Ui,U2,...UN)=  f  ( Qi(.x)  +  Yj/ujRij/Ui)dT 

t 

=  /  n{x(t),n\,fi2,...Un)dT\  i  g A 


(34) 


Define  the  A-player  game  Vf  (x(t),Ri,/l2,  ..  .Rn  )  = 

f  N 

min  /  (Qi (x)  +  X ufRijUi ) dz\  ieN 

m  t  '=! 


(35) 


integral  (35)  along  the  system  trajectories.  Define  the 
Hamiltonian  functions 

Hi(x,  VV,  ,ui,...,Un )  =  r(x,Wi, ...,««) +  (VV;  )T 

n  (38) 

(f(x)  +  ^Tlgj(x)uj)  i  G  N 

j=  i 

According  to  the  stationarity  conditions,  associated 
feedback  control  policies  are  given  by 

^  =  0  =>  [it  (Jt)  =  -  (  RngJ  (x )  Vtf ,  IGA  (39) 

olli 

Substituting  (39)  into  (37)  one  obtains  the  A  coupled  HJ 
equations 

0  =  (VV,  )T(m-ifjgi(x)R]}g](x)VVJ\  +  Qi(x)  + 

\  J=  i  /  (40) 

iVvr.'g  (  v)A>  1  A’  R  g!(.v)VV.  M(0)  =  0 

j=  i 

These  coupled  HJ  equations  are  in  ‘closed-loop’  form. 
The  equivalent  ‘open-loop’  form  is 

0  =  W,T/W  +  Qi(x')-  ^  VT,1  flgj(?c)R-ig](?c)  VVy  + 

2=1  (41) 

IX  VI’1  g  (\  )R  'R  R  g  '  (v)Vl' .  V(0)  =  0 

J=i 


By  assuming  that  all  of  the  players  have  the  same  hierar¬ 
chical  level,  we  focus  on  the  so-called  Nash  equilibrium 
that  is  given  by  the  following  definition. 


Definition  1.  (Nash  equilibrium  strategies,  Ba§ar  and  Olsder5) 
An  N-tuple  of  strategies  {/iI,/t2,...,/iN}  IM  G  fi,,  i  e  A  with 
tli  G  Qi,  i  s  A  is  said  to  constitute  a  Nash  equilibrium  solu¬ 
tion  for  an  N-plaver  finite  game  in  extensive  form,  if  the 
following  A  inequalities  are  satisfied  for  all  //,  G  £2,  ,  i  e  A: 


Ji  =  J\  (jUuUi,-,m)  <  Ji 

Ji = Jiijj-'  <  j2(iui,u2,...,ub ) 


(36) 


Jn  =  JN(ii\,U2,...,ith)  <  Jn(UuU2,...,Uk) 

The  N-tuple  of  quantities  {j[ ,J2,--,Jb}  is  known  as  a 
Nash  equilibrium  outcome  of  the  A-player  game. 

Differential  equivalents  to  each  value  function  are  given 
by  the  following  non-linear  Lyapunov  equations 


0  =  r(x,uu...,uN)  +  ( VV,-  )T 

(f(.x)  +  fJgj(x)uj),  V(0)  =  0, 
J=  i 


(37) 

i  g  A 


These  equations  are  difficult  to  solve.  An  iterative  offline 
solution  technique  is  given  by  the  Policy  Iteration  algorithm 
in  the  next  section  and  it  is  the  key  to  motivate  the  control 
structure  for  an  online  adaptive  A-player  game  solution 
algorithm.  Then  it  is  proven  that  ‘optimal  adaptive’  control 
algorithm  converges  online  to  the  solution  of  coupled  HJs 
(41),  while  guaranteeing  closed-loop  stability. 

4.2  Solution  of  the  N-player  game  using 
reinforcement  learning 

The  optimal  strategies  of  the  A-player  game  are  given  in 
terms  of  the  coupled  HJ  equations  (41).  Unfortunately,  the 
coupled  HJ  equations  (41)  are  usually  intractable  to  solve 
directly.  In  fact,  the  coupled  HJ  equations  may  not  have 
exact  analytic  solutions.  One  can  solve  the  coupled  HJ 
equations  iteratively  to  obtain  a  suitable  local  smooth  solu¬ 
tion  by  using  one  of  several  algorithms  built  on  techniques 
from  RL.S  One  method  of  RL  is  known  as  policy  iteration. 
The  following  policy  iteration  algorithm  solves  the  coupled 
HJ  equations  by  iterative  solution  of  a  far  simpler  equation, 
namely  the  non-linear  Lyapunov-like  equations  (37). 

4.2. 1  Policy  iteration  for  N-player  games 


where  VV  =  dVi/dx  G  R"’  is  the  gradient  vector  (e.g.  trans¬ 
posed  gradient).  Then,  suitable  non-negative-definite 
solutions  to  (37)  are  the  values  evaluated  using  the  infinite 


Start  with  stabilizing  initial  policies  p°1(x),...,  /.P^px). 
Given  the  A-tuple  of  policies  ,  solve  for  the  A-tuple 
of  costs  Vk\(x(t)),Vk2(x(i))...VkN(x(t))  using 
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0  =  r(x,/A,...,//jv)  +  (VVi*:)T  d4(x)  =  -|i?22^T(jf)V02TWi  (48) 

/  A  A  _  (42) 

I +  V*j(0)  =  0  Here,  the  weights  of  the  four  NNs  are  WVW2,WVW4. 

Exactly  as  in  adaptive  control,  these  are  matrices  of  unknown 
Update  the  TV-tuple  of  control  policies  using  parameters  which  must  be  estimated  or  tuned  by  online 

learning  methods.  The  NN  activation  functions  are  tj>i(x), 
ifi*'  =  argmin  [77,  (x,  VV)  ,ui,  ... ,«»)]  t'eTV  (43)  </>2(x)  and  V0i  =  dffdx,  V02  =  dfii/dx  are  the  Jacobian 

matrices. 

which  explicitly  is  This  scheme  has  the  so-called  actor-critic  structure,8,15,16 

whereby  the  critic  NNs  (46)  seek  to  learn  the  values  of  the 
p.i'(x)  =  -2  Rii'  gl  (x)WV,l<  ieN  (44)  current  policies,  e.g.  the  solution  to  the  non-linear  Lyapunov 

equations  (42).  The  actor  NNs  (47)  and  (48),  on  the  other 
A  linear  version  of  the  previous  algorithm  is  presented  by  hand,  seek  to  learn  the  optimal  policies  for  both  players.  The 
Gajic  and  Li12  and  Abou-Kandil  et  al.13  main  theorem  is  now  given.  It  provides  the  tuning  laws  for 

The  PI  algorithm  will  be  used  as  the  basis  for  online  the  critic,  and  control  NNs  that  guarantee  convergence  of  the 
learning  solution  techniques  for  optimal  game  strategies  in  online  gaming  algorithm  in  real-time  to  the  Nash  equilibrium 
the  next  section.  solution,  while  guaranteeing  closed-loop  stability. 


4.3  Online  gaming  solution  of  the  two-player  game 

The  PI  Algorithm  is  a  sequential  algorithm  that  solves  the 
coupled  HJ  equations  (42)  and  finds  the  optimal  strategies 
(44)  for  the  game.  In  this  section,  we  develop  an  online 
algorithm  for  learning  the  solution  to  the  two-player 
differential  game  in  real  time.  In  this  algorithm,  the  two 
players  learn  simultaneously  as  they  play  together  in  a 
dynamical  game.  This  is  in  effect  an  adaptive  control  algo¬ 
rithm  of  novel  form  that  converges  to  the  optimal  game 
solution.  The  online  gaming  algorithm  is  motivated  by  the 
PI  algorithm.  Consider  the  non-linear  dynamical  system 
given  by 


Theorem  1 .  (Online  games)  Let  the  dynamics  for  the  two-play- 
er  game  be  given  by  (45),  and  consider  the  game  formulation 
as  analyzed  in  this  section.  Let  the  critic  NNs  be  given  by  (46), 
the  first  control  input  be  given  by  actor  (first  player)  NN  (47) 
and  the  second  control  input  be  given  by  actor  (second  player) 
NN  (48).  Let  tuning  for  the  first  critic  NN  be  provided  by 

Wi  =  -a,  T  g3  [  of  W  +  Gi  (x)  +  U3R11U3  +  dfRncU  ]  (49) 

((73  03+  1)' 

and  the  second  critic  NN  be  provided  by 

#2  =  -a2  ,.  a'  ,,, \ff/m  +  Q2(x)  +  u7R2,U3  +  dfRild*]  (50) 

( cri  cr4  +  1) 

where  ff3  =  'Vfi(f+gU3  +  kdi)  and  ff4  =  V02( J+  gui  +  kdi).  Let 
the  first  actor  NN  (first  player)  be  tuned  as 


x  =j{x)  +  g(x)u  +  k(x)d  (45) 

with  state  x(t)  e  R",  first  control  u(t)  e  Rm,  and  second  con¬ 
trol  d{t)  e  Rq.  Assume  that/(x)  is  continuously  differentiable 
and  f[ 0)  =  0  so  that  x  =  0  is  an  equilibrium  point  of  the 
system. 

The  online  gaming  algorithm  is  an  adaptive  learning 
controller  that  is  based  on  value  function  approximation 
(VFA).14~16  Motivated  by  Equations  (42)  and  (43)  in  the 
PI  algorithm,  it  uses  four  approximator  structures,  which 
can  be  considered  as  neural  networks  (NNs).  Two  NNs 
learn  the  current  values  of  the  game,  i.e.  the  solution  of 
the  non-linear  Lyapunov  equations  (42)  for  the  current 
control  policies.  The  other  two  NNs  learn  the  two  control 
policies. 

That  is,  one  has  the  estimates  of  the  values,  and  control 
policies  respectively  expressed  as 


Vi  (x)  =  M>,t0i  (jc)  ,  V2  (x)  =  WUi  (x)  (46) 

u 3  (x)  =-fRulgT  (x)  V0,T  W>  (47) 


W 


=  -a3{(F2W-E03W')-±V</>ig(x)Rn-TR2iRulgT(x) 
V01T  W3HI2  m-jD,  (x)  H Xml  W  } 


(51) 


and  the  second  actor  (second  player)  NN  be  tuned  as 

W,  =  -a,{{FiWi-F30r2W2)-^V<l)2k(x)R22-TRnR22-'kr(x) 
V0,r  Wml  W-jD2  (x)  Wan  I  #2} 


(52) 


where  Z),(x)  =  V fi(x) g (x) Ru' gT (x) V0,T(x), 
D2(x)  =  V02(.r)kR22JkTV02T(x), 


O3 


(to  03+  1)- 


ff3 


(04  04+  1)‘ 


and  F{  >  0,  F2  <  0, 


F3  >  0,  F4  >  0  are  tuning  parameters.  Also  assume  Qfx)  >  0 
and  Q2(x)  >  0.  Then  there  exists  an  N0  such  that,  for  the  number 
of  NN  hidden  layer  units  N>  N0  the  closed-loop  system  state, 
the  critic  NN  errors  WVW2,  and  the  actor  NN  errors  WyW4  are 
bounded.  Moreover,  F,(x)  and  V2(x)  converge  to  the  solution 
to  the  coupled  HJ  equations. 
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It  is  important  to  note  that  this  algorithm  leams  the  solu¬ 
tion  to  the  coupled  HJ  equations  (42)  and  the  optimal 
policies  (44)  without  ever  in  fact  solving  either  the  coupled 
HJ  equations  or  the  Lyapunov  equations.  In  the  LQR  case, 
it  learns  the  solution  to  the  coupled  Riccati  equations  online. 

5.  Simulations 

5.  /  Two  player  non-linear  system 

In  1954,  Colonel  O.G.  Haywood  asserted17  the  little  known 
fact  that  von  Neumann’s  minorant  solution  to  two-player 
zero-sum  games  is  identical  to  the  U.S.  Military  decision 
doctrine  known  as  the  ‘Estimate  of  the  Situation’.  Today, 
two-player  zero-sum  games  are  viewed  as  essential  tools  for 
military  commanders  to  determine  the  optimal  solution  in 
an  uncertain  wartime  situation.18  Two-player  non-zero-sum 
games  are  equally  important  when  developing  strategies 
for  civilian-military  cooperation  in  peacetime  operations.19 
These  examples  illustrate  how  two-player  system  models 
are  relevant  to  military  operations,  even  when  their  usage 
involves  teams  or  groups  rather  than  just  individual  agents. 

Consider  the  following  affine  in  control,  two-player  (u  and 
d )  non-linear  system,  with  a  quadratic  cost  constructed  as20,21 

x  —  f{x)  +  g(x)u  +  k{x)d,  xeR2 


where 

m= 

g  W  = 


-X2-O . 5xi  +  0 . 25x2  (cos  (2xi )  +  2 ) 2  +  0 . 25x2  (sin  (4x2 )  +  2)  * 


0 

cos  (2xi )  +  2 


k(x)  = 


0 

sin(4x2)  +  2 


Select  Q t  =  2 Q2,  Ru  =  2 R22  and  Rl2  =  2 R1V  where  Qv  R27 
and  R,,  are  identity  matrices. 

The  optimal  value  function  for  the  first  critic  (player  1) 

is  V\  (x)  =  Jj-xi2  +  xf  and  for  the  second  critic  (player  2)  is 
Vi  (. x )  =  ^-x2  +  ^xi. 

The  optimal  control  signal  for  the  first  player  is  u*{x)  = 
-(cos(2xj)  +  2)x2  and  the  optimal  control  signal  for  the 
second  player  is  d*{x)  =  -(sin(4Xj2)  +  2)x2. 

One  selects  the  NN  vector  activation  function  for  the 
critics  as  ipi(x)  =  <P2 (x)  =  [x2  X1X2  xf].  Figure  1  shows 


Figure  I .  Convergence  of  the  critic  parameters  for  the  first 
player 


Figure  2.  Convergence  of  the  critic  parameters  for  the  second 
player 


Figure  3.  Evolution  of  the  system  states 
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the  critic  parameters  for  the  first  player,  denoted  by 
Wi=  [Wd  WC2  Wd]1  by  using  the  proposed  game  algo¬ 
rithm.  After  convergence  at  about  150  s  one  has  fVt( Q  = 
[0.5015  0.0007  1.0001]T. 

Figure  2  shows  the  critic  parameters,  denoted  by 
Wi  =  [Wid  Wici  Wic3]J  by  using  the  proposed  game  algo¬ 
rithm.  After  convergence  at  about  150  s  one  has  W2(t)  = 
[0.2514  0.0006  0.5001]T. 

The  actor  parameters  for  the  first  player  after  150  s  con¬ 
verge  to  the  values  of  (V3(tf)  =  [0.5015  0.0007  1.0001]T  and 


the  actor  parameters  for  the  second  player  after  150  s 
converge  to  the  values  of  W4(tj)  =  [0.2514  0.0006  0.5001]T 
Therefore  the  actor  NN  for  the  first  player 


u( x)  =  -1  Ru  1 


0 

cos(2jci)  +  2 


2x\ 

Xl 

0 


0 

T 

0.5015 

Xl 

0.0007 

2x2 

1.0001 

also  converged  to  the  optimal  control,  and  the  actor  NN  for 
the  second  player 


Figure  4.  Optimal  value  function  for  player  I 


Figure  6.  3D  plot  of  the  approximation  error  for  the  control 
of  player  I . 


Figure  5.  3D  plot  of  the  approximation  error  for  the  value 
function  of  player  I. 


Figure  7.  3D  plot  of  the  approximation  error  for  the  control 
of  player  2. 
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0  1T 

2x\  0  j1 

0.2514 

d{x)  = 

sin  (4xi2)  +  2 

X-2  X\ 

0  2^2 

0.0006 

0.5001 

also  converged  to  the  optimal  one. 

The  evolution  of  the  system  states  is  presented  in  Figure  3, 
where  one  can  see  how  the  PE  influences  the  states. 

Figure  4  shows  the  optimal  value  function  for  player  1 
(similarly  for  player  2). 

Figure  5  shows  the  3D  plot  of  the  difference  between  the 
approximated  value  function  for  player  1  and  the  optimal  one. 
Player  2  has  a  similar  error.  These  errors  are  close  to  zero. 

Good  approximations  of  the  actual  value  functions  are 
being  evolved.  Figure  6  shows  the  3D  plot  of  the  difference 
between  the  approximated  control  for  the  first  player,  by 
using  the  online  algorithm,  and  the  optimal  one.  This  error 
is  close  to  zero.  Same  for  the  second  player  in  Figure  7. 
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